摘要 :
In this paper, we present an optimization flow for monolithic 3D ICs called Pin-3D Optimizer. Compared with the state-of-the-art RTL-to-GDS flows that rely on ad-hoc technology file tweaks and RC scaling, Pin-3D offers a streamlin...
展开
In this paper, we present an optimization flow for monolithic 3D ICs called Pin-3D Optimizer. Compared with the state-of-the-art RTL-to-GDS flows that rely on ad-hoc technology file tweaks and RC scaling, Pin-3D offers a streamlined method to run commercial 2D IC tools to obtain high-quality monolithic 3D IC designs. Specifically, Pin-3D supports effective legalization, routing, timing closure, and ECO optimization for monolithic 3D IC designs. We propose a novel optimization methodology where the cells in each tier of a 3D IC are optimized using cell data and constraints of the full 3D design. The optimizations in a tier also directly influence the timing, power in the other tiers, leading to better overall PPA of the 3D IC. With the help of two industry processors designed with a 28 nm technology node, we show that Pin-3D provides up-to 9.0% smaller wirelength and 88% smaller total negative slack than die-by-die M3D flows. We also observe up-to 8.7% lower power and 26% smaller wirelength than 2D ICs. In addition, Pin-3D is the first flow that supports routing and timing optimization in heterogeneous logic-on-logic monolithic 3D ICs. We demonstrate this capability by performing area-balanced tier partitioning, routing, and timing closure of a 3D design with different technologies on each die.
收起
摘要 :
The quality of placement is essential in the physical design flow. To achieve PPA goals, a human engineer typically spends a considerable amount of time tuning the multiple settings of a commercial placer (e.g. maximum density, co...
展开
The quality of placement is essential in the physical design flow. To achieve PPA goals, a human engineer typically spends a considerable amount of time tuning the multiple settings of a commercial placer (e.g. maximum density, congestion effort, etc.). This paper proposes a deep reinforcement learning (RL) framework to optimize the placement parameters of a commercial EDA tool. We build an autonomous agent that learns to tune parameters optimally without human intervention and domain knowledge, trained solely by RL from self-search. To generalize to unseen netlists, we use a mixture of handcrafted features from graph topology theory along with graph embeddings generated using unsupervised Graph Neural Networks. Our RL algorithms are chosen to overcome the sparsity of data and latency of placement runs. Our trained RL agent achieves up to 11% and 2.5% wirelength improvements on unseen netlists compared with a human engineer and a state-of-the-art tool auto-tuner, in just one placement iteration (20× and 50× less iterations).
收起
摘要 :
The quality of placement is essential in the physical design flow. To achieve PPA goals, a human engineer typically spends a considerable amount of time tuning the multiple settings of a commercial placer (e.g. maximum density, co...
展开
The quality of placement is essential in the physical design flow. To achieve PPA goals, a human engineer typically spends a considerable amount of time tuning the multiple settings of a commercial placer (e.g. maximum density, congestion effort, etc.). This paper proposes a deep reinforcement learning (RL) framework to optimize the placement parameters of a commercial EDA tool. We build an autonomous agent that learns to tune parameters optimally without human intervention and domain knowledge, trained solely by RL from self-search. To generalize to unseen netlists, we use a mixture of handcrafted features from graph topology theory along with graph embeddings generated using unsupervised Graph Neural Networks. Our RL algorithms are chosen to overcome the sparsity of data and latency of placement runs. Our trained RL agent achieves up to 11% and 2.5% wirelength improvements on unseen netlists compared with a human engineer and a state-of-the-art tool auto-tuner, in just one placement iteration (20× and 50× less iterations).
收起
摘要 :
In this paper, we present an optimization flow for monolithic 3D ICs called Pin-3D Optimizer. Compared with the state-of-the-art RTL-to-GDS flows that rely on ad-hoc technology file tweaks and RC scaling, Pin-3D offers a streamlin...
展开
In this paper, we present an optimization flow for monolithic 3D ICs called Pin-3D Optimizer. Compared with the state-of-the-art RTL-to-GDS flows that rely on ad-hoc technology file tweaks and RC scaling, Pin-3D offers a streamlined method to run commercial 2D IC tools to obtain high-quality monolithic 3D IC designs. Specifically, Pin-3D supports effective legalization, routing, timing closure, and ECO optimization for monolithic 3D IC designs. We propose a novel optimization methodology where the cells in each tier of a 3D IC are optimized using cell data and constraints of the full 3D design. The optimizations in a tier also directly influence the timing, power in the other tiers, leading to better overall PPA of the 3D IC. With the help of two industry processors designed with a 28 nm technology node, we show that Pin-3D provides up-to 9.0% smaller wirelength and 88% smaller total negative slack than die-by-die M3D flows. We also observe up-to 8.7% lower power and 26% smaller wirelength than 2D ICs. In addition, Pin-3D is the first flow that supports routing and timing optimization in heterogeneous logic-on-logic monolithic 3D ICs. We demonstrate this capability by performing area-balanced tier partitioning, routing, and timing closure of a 3D design with different technologies on each die.
收起
摘要 :
Monolithic 3D IC (M3D) can continue to improve power, performance, area and cost beyond traditional Moore's law scaling limitations by leveraging the third-dimension and fine-grained monolithic inter-tier vias (MIVs). Several rece...
展开
Monolithic 3D IC (M3D) can continue to improve power, performance, area and cost beyond traditional Moore's law scaling limitations by leveraging the third-dimension and fine-grained monolithic inter-tier vias (MIVs). Several recent studies present methodologies to implement M3D designs, but most, if not all of these studies implement top and bottom tier separately after partitioning, which results in inaccurate buffer insertion. In this paper, we present a new methodology called `Cascade2D' that utilizes design and micro-architecture insight to partition and implement an M3D design using 2D commercial tools. By modeling MIVs with sets of anchor cells and dummy wires, we implement and optimize both top and bottom tier simultaneously in a single 2D design. M3D designs of a commercial, in-order, 32-bit application processor at the foundry 28nm, 14/16nm and predictive 7nm technology nodes are implemented using this new methodology and we investigate the power, performance and area improvements over 2D designs. Our new methodology consistently outperforms the state-of-the-art M3D design flow with up to 4× better power savings. In the best case scenario, M3D designs from the Cascade2D flow show 25% better performance at iso-power and 20% lower power at isoperformance.
收起
摘要 :
Monolithic 3D IC (M3D) can continue to improve power, performance, area and cost beyond traditional Moore's law scaling limitations by leveraging the third-dimension and fine-grained monolithic inter-tier vias (MIVs). Several rece...
展开
Monolithic 3D IC (M3D) can continue to improve power, performance, area and cost beyond traditional Moore's law scaling limitations by leveraging the third-dimension and fine-grained monolithic inter-tier vias (MIVs). Several recent studies present methodologies to implement M3D designs, but most, if not all of these studies implement top and bottom tier separately after partitioning, which results in inaccurate buffer insertion. In this paper, we present a new methodology called `Cascade2D' that utilizes design and micro-architecture insight to partition and implement an M3D design using 2D commercial tools. By modeling MIVs with sets of anchor cells and dummy wires, we implement and optimize both top and bottom tier simultaneously in a single 2D design. M3D designs of a commercial, in-order, 32-bit application processor at the foundry 28nm, 14/16nm and predictive 7nm technology nodes are implemented using this new methodology and we investigate the power, performance and area improvements over 2D designs. Our new methodology consistently outperforms the state-of-the-art M3D design flow with up to 4× better power savings. In the best case scenario, M3D designs from the Cascade2D flow show 25% better performance at iso-power and 20% lower power at isoperformance.
收起
摘要 :
In this paper, we present a comprehensive study of full-chip power, performance, and area metric for monolithic 3D (M3D) IC designs at the 7nm technology node. We investigate the benefits of M3D designs using our predictive 7nm Fi...
展开
In this paper, we present a comprehensive study of full-chip power, performance, and area metric for monolithic 3D (M3D) IC designs at the 7nm technology node. We investigate the benefits of M3D designs using our predictive 7nm FinFET libraries. This paper outlines detailed iso-performance power comparisons between M3D and 2D full-chip GDSII designs using both 7nm high performance (HP) and low stand-by power (LSTP) library cells. We achieve significant wire-length and buffer reduction with 7nm HP M3D designs over 2D counterparts, thus more power saving at high iso-performance frequency. In addition, this power saving is also realized in 7nm LSTP M3D designs running at low iso-performance frequencies. We also study the impact of clock tree design on the clock power consumption in M3D designs. Lastly, we demonstrate the impact of clock tree partitioning on the total power of full-chip M3D designs. Our experiments show that 7nm HP and LSTP M3D designs outperform its 2D counterparts by 12% and 10% on average, respectively.
收起
摘要 :
In this paper, we propose a coarse-grained reconfigurable architecture, which supports both integer type application domain and floating-point type application domain. Our coarse-grained reconfigurable architecture has an 8x8 arra...
展开
In this paper, we propose a coarse-grained reconfigurable architecture, which supports both integer type application domain and floating-point type application domain. Our coarse-grained reconfigurable architecture has an 8x8 array of integer processing elements to execute 64 integer operations or 32 floating-point operations in parallel. In order to show the feasibility of the proposed architecture, we use MPEG4 decoder as an integer type application and physics engine for 3D graphics as a floating-point type application. We first analyze these applications and optimize their implementation on the proposed architecture at the system level. Then we implement the proposed architecture on an FPGA board, which decodes 12 frames per second of MPEG4 CIF images and execute up to 160 million floating-point operations per second for the physics engine at 20MHz clock frequency.
收起
摘要 :
In this paper, we propose a coarse-grained reconfigurable architecture, which supports both integer type application domain and floating-point type application domain. Our coarse-grained reconfigurable architecture has an 8x8 arra...
展开
In this paper, we propose a coarse-grained reconfigurable architecture, which supports both integer type application domain and floating-point type application domain. Our coarse-grained reconfigurable architecture has an 8x8 array of integer processing elements to execute 64 integer operations or 32 floating-point operations in parallel. In order to show the feasibility of the proposed architecture, we use MPEG4 decoder as an integer type application and physics engine for 3D graphics as a floating-point type application. We first analyze these applications and optimize their implementation on the proposed architecture at the system level. Then we implement the proposed architecture on an FPGA board, which decodes 12 frames per second of MPEG4 CIF images and execute up to 160 million floating-point operations per second for the physics engine at 20MHz clock frequency.
收起
摘要 :
In this paper, we propose a coarse-grained reconfigurable architecture, which supports both integer type application domain and floating-point type application domain. Our coarse-grained reconfigurable architecture has an 8x8 arra...
展开
In this paper, we propose a coarse-grained reconfigurable architecture, which supports both integer type application domain and floating-point type application domain. Our coarse-grained reconfigurable architecture has an 8x8 array of integer processing elements to execute 64 integer operations or 32 floating-point operations in parallel. In order to show the feasibility of the proposed architecture, we use MPEG4 decoder as an integer type application and physics engine for 3D graphics as a floating-point type application. We first analyze these applications and optimize their implementation on the proposed architecture at the system level. Then we implement the proposed architecture on an FPGA board, which decodes 12 frames per second of MPEG4 CIF images and execute up to 160 million floating-point operations per second for the physics engine at 20MHz clock frequency.
收起